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Abstract — We introduce Tyler's M-estimator to robustly re- 
cover the underlying linear model from a data set contaminated 
by outliers. We prove that the objective function of this estimator 
is geodesically convex on the manifold of all positive definite 
matrices and have a unique minimizer. Besides, we prove that 
when inliers (i.e., points that are not outliers) are sampled from 
a subspace and the percentage of outliers is bounded by some 
number, then under some very weak assumptions a commonly 
used algorithm of this estimator can recover the underlying 
subspace exactly. We also show that empirically this algorithm 
compares favorably with other convex algorithms of subspace 
recovery. 



I. Introduction 

This paper is about the following problem: suppose we 
are given a data set X with inliers sampled from a low- 
dimensional linear model and some arbitrary outliers, can we 
recover the underlying linear model? The primary tool for this 
problem is Principal Component Analysis (PCA). However, 
PCA is very sensitive to outliers. Considering the popularity 
of linear modeling, an robust algorithm that find the underlying 
linear model will have many applications. 

This work introduces Tyler's M-estimator of covariance 
in |28l . and proves that the objective function is geodesically 
convex on the manifold of all positive definite matrices. 
Moreover, this work proves that when inliers are sample from 
a subspace L*, a commonly used algorithm for this estimator 
finds the underlying subspace L* under very weak conditions 
that almost only depend on the percentage of outliers. 



A. Notation and conventions 

We assume that we are given a data set X C M. D with N 
points. We define the projector 11^ as the D x D symmetric 
matrix such that Il| = 11^, and the range of II ^ is L. We 
define Pj, as the D x dim(L) projection matrix from M. D to 
L, or equivalently, any Pl such that 11^= P^P^. We use 
L to denote the orthogonal complement of L. 

We use X n L to express the set of points that lie both in 
X and the subspace L, and use X \ L to express the set of 
points that lie in X but not in the subspace L. We use \X\ to 
denote the cardinality of the set X, S+(D) to denote the set 
of all D x D semi-positive definite matrices and S++(Z?) to 
denote the set of all D x D positive definite matrices. 
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B. Main results 

In this paper we introduce the following estimator due to 
Tyler 



= arg min ^(Ti), where 

tr(S) = l,S=S T ,SgS + + (r>) 



(1.1) 



F(E) 



1 



log(x T S- 1 a;) 



logdet(S), 



and we obtain S* by the limit of the sequence £( fc ) generated 
by the following iterative procedure in 



X 



L X 



(1.2) 



We will explain the motivation of this estimator as an M- 
estimator of covariance in Section H-Dl and show in Section ITTT1 
that the objective function F(H) is geodesically convex on 
S++(D), and under the condition dill- lb the sequence T,^ 
generated by (11.2b converges to the unique solution of (II.lt . 

When the inliers lie exactly on the subspace L*, then 
under some weak assumptions (almost only depends on the 
percentage of outliers) we can recover the L* exactly from 
linifc-^oo H^, which is a singular matrix with L* as its range. 

Theorem LI. If there exists a d-dimensional subspace 
such that 

*"^>i, (1.3) 



X 



D 



and the points in the set 3^1 = {Pl, x : x E X n £*} C M. d 
and yo = {Pl±x : x € X \ L*} C M. D ~ d lie in general 
positions respectively (i.e., any k points in 3^1 span a k- 
dimensional subspace for all k < d and any k points in J^o 
span a k- dimensional subspace for all k < D — d). Then the 
sequence S( fc ' generated by (11.21 ) converges to some S such 
that im(S) = L*. 

The condition of "general positions" is very weak — for 
example, when we choose inliers arbitrarily from a uniform 
distribution in a ball in L*, or from Gaussian measure in 
and choose outliers arbitrarily from a uniform measure in a 
ball in M. D , or from Gaussian measure in ~R D this condition 
holds with probability 1. 

We remark that when the ambient dimension D — > oc 
and the dimension of subspace L* is kept as the constant d, 
then -0 approaches and the required percentage of inliers in 
Theorem II. 1 1 approaches 0. This property makes Theorem II. 1 1 
particularly strong for high-dimensional data set with a low- 
dimensional structure. 



2 



C. Previous work 

The robust estimator for covariance has been well studied 
in the statistical literature, which is related topic to robust 
linear modeling since we can recovery the linear model by 
the principal components of the estimated covariance. The M- 
estimators, L-estimators, MCD/MVE estimators and Stahel- 
Donoho estimator have been proposed; for a complete review 
we refer the reader to 1120] Section 6]. However most of these 
methods are not convex (or the convexity is not analyzed), so 
the algorithms are intractable (or have unknown tractability). 
it is possible that the only exception is M-estimators: the 
convergence of its algorithm has been analyzed in ||T3l . and 
we will show later that as a special M-estimator, F(E) is 
geodesically convex in the space of positive definite matrices 
(it is also shown in 11321 ). 

There are other methods that recover the linear model 
without estimating the robust covariance, such as Projection 
Pursuit Ifl8l , 0], [221 Section 2], which find the principal 
component directly by optimization on a sphere. Another 
common strategy is to fit the linear model by PCA after 
removing possible outliers 11271 . 11331 . However these methods 
are still nonconvex. 

Some recent works on robust linear modeling focus on 
convex optimization and tractable algorithms 11341 . 11221 . 11351 . 
[171 . Similar to Theorem 11.11 these works provide conditions 
for exact subspace recovery. We remark that these conditions 
are more complicated than the condition in Theorem 11.11 
since they assume an "incoherence condition" that requires 
the inliers to be spread out on the subspace L*, which is not 
required in Theorem [TT] These kind of conditions are required 
in flU Theorem 1], [35] (6)(7)]. In [J7] Theorem 1.1] it 
is shown that the exact recovery holds with high probability 
when , , , 

— ^— >C Q + C 1 — —, 

which is a simple condition and very similar to the condition in 
( 11.31 ). However this condition is obtained under the assumption 
that inliers and outliers are both sampled from Gaussian dis- 
tributions. In another recent work, Soltanolkotabi and Candes 
proved that sparse subspace clustering (SSC) algorithm (]7] 
can recover multiple subspaces with high probability, but 
this theory also have probabilistic assumptions: it assumes 
that inliers and outliers are both i.i.d sampled from uniform 
measures on unit spheres 11261 Theorem 1.3]. 

We remark that our condition (11.3) sometimes can be more 
restrictive than the corresponding conditions of other convex 
methods. For example, when outliers have small magnitude 
and concentrate around the origin, then the conditions in 11351 
Theorem 2] can tolerate more outliers. Similarly, the con- 
ditions in [34l Theorem 1] and [26] Theorem 1.3] can also 
tolerate more outliers than ( 11.3) under some settings. The 
advantage of our condition is that it is deterministic and 
simple, and empirically it is also usually less restrictive than 
the conditions in [341 Ell. l35l. fPTI. 

D. M-estimators 

In this section we show that the estimator ( II. It can be 
considered as a special M-estimator of covariance and gives 



background of the current research on this estimator. We start 
with the motivation: it is well known that the empirical covari- 
ance is the MLE estimator for the covariance when we assume 
that all x € X are i.i.d. drawn from a Gaussian distribution. 
As a natural generalization, M-estimators of covariance |fj"9ll , 
[Toll . I2TI consider the generalized distribution 

C{p)e- p ^ Ti: ' 1 ^ I Vdet(E), (1.4) 

where C(p) is a normalization constant, chosen so that the 
integral of the distribution is equal to one. It is a generalization 
since when p(x) = x, (11 -4b gives Gaussian distribution. When 
data points are i.i.d. sampled from the distribution (11.4) , the 
corresponding MLE estimator of covariance is called M- 
estimator, which minimizes 

^ £ p{x T Jr 1 x) + \ logdet(S). (1.5) 

xex 

The objective function can be considered as the M- 

estimator when p(x) = y log(x). While for this choice of p 
the function in ( 1I.4| | is not a distribution, it can be considered 
as the limit of the following multivariate student distribution 
as v -> GO] page 187]: 

T[{v + D)/2] 

r(^/2)^ D / 2 7 r D / 2 v / ditE[l + ^xT^x^+D)/ 2 ' 

Since student distribution has a heavy tail, it is expected that 
this estimator should be more robust to outliers. 

The reason that we enforce the condition tr(S) = 1 in dl.ll ) 
is the scale invariance property of F(£): for any constant 
c > and S e S ++ (D), we have 

F(E) = F(cE). (1.6) 

This simple fact can be easily verified, but will be repetitively 
used in the analysis later. 

Tyler and Kent investigated the estimator (11.11 ) implicitly, 
by solving the equation F'(~E) = Il28l. ITT21. They obtained 
the uniqueness of the solution (up to scaling) to F'(E) = 
and showed that the algorithm (11.21 converges to this solution 
in [121 Theorem 2], under the assumption (IIII. lb . This result 
is almost equivalent Theorem IIII. II and Theorem IIII. 41 in this 
work, except that we consider the minimization of F(H) 
directly and also show the existence of the minimizer. An 
interesting claim in [281 is that, this estimator is the "most 
robust" estimator of the scatter matrix of an elliptical distri- 
bution in the sense of minimizing the maximum asymptotic 
variance. 

The geodesical convexity of the objective function F(H) 
was discovered later. In Auderset et al. showed that the 
function is geodesically convex on the space {Eg S++(D) : 
det(S) = 1}. After finishing this work, we learned that 
the geodesical convexity of F(E) on the space S++(D) 
was recently independently investigated by Wiesel in [32l 
Proposition 1]. Wiesel also extended the convex analysis to 
regularized Tyler's M-estimator in 1132) and generalized LSE 
(logarithm of a sum of exponents) functions and the estimation 
of Kronecker structured covariance in ll30l . JUJ. 
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E. Contributions and the structure of this paper 

The main contribution in this work is that, we introduce 
Tyler's M-estimator for subspace recovery, and justify this 
estimator by showing that the algorithm (II. 2b can recover the 
underlying subspace exactly under rather weak assumptions on 
the distribution of data points. Besides, we also apply geodesic 
convexity and majorization-minimization argument to show 
the existence and the uniqueness of the minimizer, and the 
convergence of the algorithm. While these two facts are also 
observed in J32), the analysis in this paper is more careful 
and therefore it proves the uniqueness of the minimizer and 
the pointwise convergence of the algorithm. 

The paper is organized as follows. In Section [El] we intro- 
duce the background on the geometry of S ++ (D) and geodesic 
convexity. With this background we prove the uniqueness of 
the solution to ( II. 11 ) and the convergence of the algorithm 
( 11.21 ) in Section [HI] Then we perform some experiments 
that describe the performance of Algorithm |L2] and verify 
Theorem [IT| in Section |IV] Technical proofs are shown in 
the Appendix. 

II. Preliminaries 

Our analysis relies on basic concepts from the geometry 
of S++(D) and the geodesic convexity. For this purpose, in 
Section IH-AI we present a brief summary of the geometry of 
S+ (D) and in Section III-BI we introduce the definition and 
some properties of geodesic convexity. For more details we 
refer the reader to H, 11291 on the geometry of S++(Z?) and 
the geodesic convexity. 

A. Metric and geodesic on S++(D) 

The metric of S++(Z?) has been well studied in literature. 
Indeed, the trace metric in differential geometry ifTBI pg 326], 
natural metric in symmetric cone [H], El, affine-invariant 
metric 11241 . and the metric given by Fisher information matrix 
for Gaussian covariance matrix estimation 0251 give the same 
metric on S++(D). For Ei,E 2 <G S++(D), these metrics are 
defined by: 

dist(Ei,E 2 ) = ||log(E7*S 2 S^)|| F . (HI) 

Based on this metric, the unique geodesic 7e 1 e 2 (£) (0 < t < 
1) connecting Si and E 2 is given by J4] (6.11)]: 



(II.2) 



It follows that the midpoint of Si and S 2 is 

TfciS2 a (i) = sf(S~*S 2 S~*)3Sf. We remark that 

i _i _i i i 
Sj 2 (Sj 2 S 2 S 1 ! )5Sj 2 is also called the geometric mean of 

Si and S 2 Section 4.1]. 



B. Geodesic convexity 

Geodesic convexity is a natural generalization of the con- 
vexity to Riemannian manifolds [29, Chapter 3.2]. Given a 
Riemannian manifold Ai and a set A C Ai, we say a function 
/ : A M is geodesically convex, if every geodesic j xy of 



Ai with endpoints x, y € A (i.e., j xy is a function from [0, 1] 
to Ai with 7a;j,(0) = x and 7^(1) = y) lies in A, and 

/(7«v(*)) < (1 - *)/(*) + tfiv) for any x,y £ A and <t < 1. 

(H.3) 

Following the proof of 11231 Theorem 1 . 1 .4], for a continuous 
function, the geodesic midpoint convexity is equivalent to the 
geodesic convexity: 



Lemma II. 1. Let f : A 



be a continuous function. If 



fhxy{-)) < for any x^y&A (II.4) 

then f is a geodesically convex function. 



III. Property of the objective function and the 

ALGORITHM 

In this section we study the properties of the objective 
function F(S) and the algorithm in (11.21 ). We show that 
under a very mild assumption, the solution to (ll.lt is unique 
and the sequence S^) converges to the solution. We will 
also discuss Theorem II. 11 the empirical algorithm and some 
implementation issues in this section. 

A. Uniqueness of the solution 

We first show the existence and the uniqueness of the 
solution to ( |I. It under a rather weak assumption. 

Theorem III.l. If for any linear subspace L we have 



\XC)L\ ^ dim(i) 



TV D 
then the solution of ( 11.11 ) exists and is unique. 



(III.l) 



Indeed, for real data sets that contain noise, (IIII. lb is usually 
satisfied if the dimension is smaller than the number of points: 
in noisy data set generally any d-dimensional linear subspace 
only contains at most d points. 

An important remark is that the condition (IIII. lb is in- 
compatible with the condition on the percentage of inliers in 
Theorem lI.il Indeed, the solution to ( 11.11 ) does not exists: one 
may verify that F ( tr (n^ J+ei) ) converges to — oo as e — >• 0, 



converges to a singular matrix where F(S) 



while ; 
is undefined. 

The proof of Theorem IIII. II depends on the following two 
lemmas, whose proofs will be presented in the appendix. In 
general, Lemma llll.2| guarantees the uniqueness of the solution 
and Lemma IIII. 31 guarantees the existence of the solution. 
While dIH.2l ) is also proved to l32l Proposition 1], additionally 
we show the condition for the equality in Lemma BlI. 21 which 
implies the uniqueness of the minimizer of ( II. lb . 

Lemma III.2. -F(S) is geodesically convex on the manifold 
S++(D). That is, for any Si and S 2 £ S++(D), we have 

F(Si) + F(S 2 ) > 2F(S? (Ei »E 2 Ei »)*E? )• (HL2) 

When spanjA"} = M. D , the equality in (IIII. 2b holds if and 
only if Si = cS 2 . 
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Lemma III.3. Under the condition fllll.ll ), we have 

F(S) -> oo as A min (S) -> 0. (IH.3) 

//ere A m ; n (S) w the smallest eigenvalue of S. 

Now we are ready to prove Theorem Hill I 

Proof: We first prove the uniqueness of the solution to 
dTTTb . If Si ^ S 2 are both solutions to CTJ, then apply ( 1111.2b 
and the scale invariance (II. 6b. we have 



F(E 3 ) < F(Si) = F(S 2 ), for 

s*(s-*s 2 s-*)*s? 
3 - — t^. — ~ — it- 

Since Si and S 2 are both minimizers to F(S), we have 
F(S 3 ) = F(Si) = F(Ei), by the condition of equality in 
Lemma [III. 21 (the assumption span{<Y} = M. D in Lemma [III. 21 
holds; otherwise dlll.ll ) does not hold for L = spa,n{X}), 
we have Si = cS 2 . However we have tr(Si) = tr(S 2 ), 
therefore Si = S 2 , which contradicts out assumption, and 
we prove the uniqueness of the solution to ( II. lb . 

Now we prove the existence of the solution. First, there 
exists a sequence {Si}i>i C {S E S++(D) : tr(S) = 1} 
such that F(Hi) converges to 

m ftr(s)=i,ses ++ (n) F(£). By compactness there is a con- 
verging subsequence of {S^}, and by Lemma IIII.3I this 
subsequence does not converge to a singular matrix and 
therefore the subsequence converges to some matrix So G 
S ++ (.D). By continuity of F(E) we obtain F(S ) = 
m ftr(£)=i,£es ++ (D) F(S) and therefore S is a solution to 

CD- ■ 

B. Convergence of algorithm 

In this section we prove the convergence of the sequence 
(S fe ) generated by ( II .21 under the assumption (IIH.lt . and 
we will also discuss its connection to Theorem 11.11 which 
is about the convergence of the sequence (S fc ) under another 
assumption. 

We begin with the motivation of the procedure (II. 2) . If we 
set the derivative of -F(S) with respect to S _1 to be 0, we 
have 

ds^( s ) = ^ ~Mf^ = 

xex 

and 

„ D \ - xx 



N ^ x T Y>-^x' 

xex 

Since we minimize F(S) under the assumption tr(S) = 1, 
we have 



S* — 



D_ xx 



X£X x T- £l - 1 X 



tr 



£ a 



jxex jTs-ij. ) tr yJ2xex x t s -i X/ 

whose RHS is the update formula dl.2t . Therefore we have the 
motivation that the 



Theorem IIII.4I shows that S' fc ^ converges to the solution 
to (11.11 ) under the assumption ( IIII.ll ). Similar to ll32l Section 



II], it uses the majorization-minimization argument. However 
the analysis here is more complete in the sense that it proves 
the convergence of the sequence S^ fc \ while the argument in 
ll32l only leads to the convergence of the objective function 
F(S«). 



Theorem III.4. When the condition ( IIII.ll ) holds, the sequence 
S' fe ) generated by (11.2b converges to the unique solution to 

CD. 

This theorem also implies that the condition ^^rM > in 

Theorem II. H is almost necessary. Indeed, if < -p, then 

the condition ( IIII.ll ) is usually satisfied, and by Theorem lIII.ll 
the solution to ( 11.11 ) exists (and by definition nonsingular) and 
by Theorem IIII.4I S^ fc ) converges to this nonsingular matrix. 
Therefore we can not recovery by its range. This also shows 
a phase transition phenomenon at = 

For simplicity in the proof we define the operator T : 
S+(D) -> S+(£>) as 



xx 



xex 



xex 



cc T S 



(III.4) 



The main ingredient for the proof is the observation that 
_ T^j^k)) can b e considered as a minimizer of 
a majorization function G(S,S( fe ') over -F(S) such that 
G(S,S«) > F(S) and G(S( fc ),S( fe )) = In 
this sense it can be considered as an algorithm with the 
majorization-minimization (MM) principle ATI . We remark 
that similar observations are also used in the proof of the 
convergence in other iteratively reweighted least square (IRLS) 
algorithms such as fl4l. Il6l. 11351. ITP71. 

When the condition in Theorem 11.11 holds, the assump- 
tion (1HI.U is violated, and by our analysis in Section lTlI-Al the 
solution to (II. lb does not exist. However, Theorem II. 1 1 shows 
that the sequence S' fc ' still converges and it converges to a 
singular matrix. Due to the complexity we put its proof in the 
appendix. 

C. Empirical algorithm and implementation issues 

Since the solution to dill ) can be considered a robust 
estimator of covariance, empirically we can simply recover 
the underlying e?-dimensional subspace by the span of its top 
d eigenvectors. Our empirical algorithm in summarized in 
Algorithm Q] 

In each iteration the major computational cost is due to 
the calculation of the inverse of S' fc ' and the calculation 
of ^2 xeX x T 'S^~ 1 x, therefore the cost is in the order of 
0(N LP) when N > L>. We will show later in Section HV^Cl 
that the algorithm exhibit linear convergence. In implementa- 
tion we stop the algorithm after fc-th iteration when 

M S (fc) _ s (k-i)i 



is«| 



^ < io- 8 . 



In this paragraph we describe an empirical problem where 
the algorithm breaks down at some iteration step, and describe 
a way to overcome it. If the condition in Theorem II. 1 1 holds, 
A m j n (S' fc )) converges to as k —> oo and it is nonzero for 
each k. However in implementation, due to the rounding error, 
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Algorithm 1 Empirical algorithm for recovering a d- 
dimensional subspace 

Input: X £ ~R D : data set, d: dimension of the subspace. 
Output: L*: a e?-dimensional linear subspace. 
Steps: 

• Initialization: S (0) = I. 

• Repeat (l)-(2) until convergence: 
1) k = k + 1, 

2) 



N Q =100, D=10, d=5 



a; a; 



» Let E* be the limit of the sequence EW, and let L* 
be the span of top d eigenvectors of X*. 



when fc is very large and A m i n (E( fe )) is very close to zero, the 
calculated X' fe ) _1 could be a non-positive matrix or a matrix 
with imaginary part and the convergence of Algorithm Q] 
fails. Therefore in implementation we check the value of 
min xe ^- x T Yi^ ~ x x in each iteration, and stop the algorithm 
when it is negative or has imaginary part. We remark that 
this breakdown will not happen for real data sets or synthetic 
data sets with noise, since in these cases converges to 
a nonsingular positive matrix and the rounding error will not 
make YP^ _1 a non-positive matrix or a matrix with imaginary 
part. 

D. Discussion on spherical projection 

A simple and powerful method to enhance the robustness 
of an algorithm to outliers is to preprocess the data set 
by projecting the data points to a unit sphere. Empirically 
it enhances the robustness of PCA and Reaper algorithms 
significantly ifTTl Section 5]. Therefore a natural question is 
that whether it can also be applied in our algorithm. 

Interestingly, spherical projection has been implicitly ap- 
plied in the objective function F(H) and our algorithm: one 
may verify that the magnitude of any point in X does not 
impact the solution to (II. U . or the update formula (II. 2\ . 
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Fig. 1. The dependence on the number of inliers and recovery error: x-axis 
is the number of inlier and y-axis is the corresponding recovery error 



B. Exact recovery of the subspace 



IV. Numerical Experiments 

In this section, we present some numerical experiments 
on Algorithm [T] to obtain the empirical performance of this 
algorithm. We also show that our algorithm outperforms other 
convex algorithms of robust PCA by a real data set. 

A. Model for simulation 

In Sections HV-BIIIV-Dl we apply our algorithm on the 
data generated from the following model. We choose a d- 
dimensional subspace L», sample Ni points i.i.d. from the 
Gaussian distribution iV(0, Ux.) on L*, and sample An 
outliers i.i.d. from the uniform distribution in the cube [0, l] D . 
In some experiments we also add a Gaussian noise A(0,£ 2 I) 
to each of the point. We use uniform distribution in [0, 1] D for 
outliers, to show that our algorithm allows anisotropic outliers. 



In this section we use the model in Section IIV-AI choose 
D = 10 or 50, d = 5, N = 100 and different values of N 1 
(2 to 20 for D = 50 and 80 to 120 for D = 10). The mean 
recovery error ||Il£ — TLl, \\f over 20 experiments is recorded 
in Figure Q] where L is obtained by the Algorithm Q] and L* 
is the true underlying subspace. Theorem |TT] guarantees that 
, || F = for Ni > 100 when D = 10 and Ni > 10 
50, and this is verified in this experiment. When 
D = 50 and JVi = 11 there is a small nonzero recovery error, 
which seems to contradicts Theorem [TT| But we remark that 
when D = 50 and N\ = 11 the convergence is slow, and 
we stop the algorithm at the 1000-th iteration without really 
converging to the solution to ( II. It . We expect that exact recover 
of the subspace L* could be obtained after larger number of 
iterations in Algorithm Q] 



l|n £ -n L 

when D = 



6 



N Q =100, D=10, d=5 




Number of iterations 

Fig. 2. Convergence rate for simulated data sets. See the text in Section \lV-C\ 
for more details of the experiment. 



C. Convergence rate 

In this section we show that empirically the algorithm 
converges linearly. In the left figure in Figure |2] we show 
the convergence rate for simulated data sets with D = 10, 
d = 5, N = 100 and N x = 80, 100, 120. Additional we 
add a Gaussian noise with e = 0.01. The a>axis represent the 
number of iterations k and the y-axis is ||S( fc ) — E^Hf, 
where K is the total number of iterations in Algorithm Q] In 
the right figure we show a different convergent rate: for two 
different simulations with no noise we plot \\U.L k ~ IIlJ|f 
with respect to the number of iterations k, where is the 
span of first d eigenvectors of X' fe ). We use the settings 
{N u N ,D,d) = (120,100,10,5) and (20,100,50,5) since 
by Theorem UTIlimfc^oo \\TlL k — ^-l, \\f = 0. From the right 
figure in Figure [2] we see that the recovery error also converges 
linearly. 




1CT 6 10"' W' 2 10° 

Size of noise e 



Fig. 3. Robustness to noise: the x-axis represents the size of Gaussian noise 
e, and the y-axis represents the recovery error. 



D. Robustness to noise 

In this section we investigate the empirically robustness of 
our algorithm to noise by simulated data set sampled according 
to section H^A] with (Ni,N ,D,d) = (120,100,10,5), and 
different size of noise e. We use this setting since when e = 0, 
we recover the subspace exactly. We record the recovery error 
in Figure IIV-DI with respect to the size of noise. In this 
experiment the recovery error depends linearly on the size 
of noise e. Indeed, we consider a theory that explain the 
performance of Algorithm Q] under some noise of small size 
as an interesting future question. 

E. Faces in a Crowd 

In this section we test our algorithm on the experiment of 
"Faces in a Crowd" in ifTTl Section 5.4]. 

The goal of this experiment is to show that our algorithm 
can be used to robustly learn the structure of face images. 
Linear modeling is applicable here since the images of the 
faces from the same person lies on a 9-dimensional sub- 
space Q. In this experiment we learn the subspace from a 
data set that contains 32 face images of a person from the 
Extended Yale Face Database lfl6ll and 400 random images 
from the BACKGROUND/Google folder of the CaltechlOl 
database Q. The images are converts to grayscale and down- 
sample to 20 x 20 pixels. We preprocess the images by 
subtracting their Euclidean median, apply Algorithm Q] to this 
data set to obtain a 9-dimensional subspace, and we use 32 
other images from the same person to test how the learned 
subspace fits these images. 

This experiment is also used in IfTTl Section 5.4], therefore 
we only compare our algorithm with S-Reaper, which has 
been shown to perform better than PCA, spherical PCA, LLD 
and Reaper algorithms. PCA algorithm is still included for 
comparison since it is the basic technique in linear modeling. 
Figure IIV-EI shows five images and their projections to the 
9-dimensional subspace fitted by PCA, S-reaper and our algo- 
rithm (which is labeled as "M-estimator" due to the argument 
in Section II-DI) respectively. Figure IIV-EI shows that our 
algorithm visually performs better than S-Reaper, especially 
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Fig. 4. The projection of images to the fitted subspace. 
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S-reaper 
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Distance to the PCA subspace 



Fig. 5. Ordered distances of the 32 test images to the fitted 9-dimensional 
subspaces by Algorithm [7J S-reaper and PCA. 



for the test images. This observation can also be quantitatively 
verified by checking the distances of 32 test images to the fitted 
subspace by PCA, S-reaper and out algorithm, which is shown 
in Figure IIV-EI The subspace generated by our algorithm has 
smaller distances to the test images, which explain the better 
performance of our algorithm in Figure IIV-EI 

Besides, in this experiment our algorithm performs much 
faster than S-Reaper; our algorithm costs 4.4 seconds on 
a machine with Intel Core 2 Duo CPU at 3.00GHz and 
6GB memory, while S-reaper cost 40 seconds, it is expected 
since there is an additional eigenvalue decomposition in each 
iteration of the S-Reaper algorithm. 

V. Discussion 

In this paper we have investigated an M-estimator for 
covariance estimation, and proved that this estimator can 
find the underlying subspace exactly under a rather weak 



assumption. We also demonstrated the virtue of this methods 
by experiments on simulated data sets and real data sets. 

An open question is that, if we can have a theoretical 
guarantee on the robustness of our algorithm to noise and 
therefore verify the empirical performance in Section IIV-DI 
We find it difficult to apply the commonly used perturbation 
analysis in |[35l Section 2.7] or l34l Theorem 2], which are 
based on the size of the perturbation of the objective function, 
since the objective function F(H) at a singular matrix is 
undefined. 

An interesting direction is to extend the idea of geodesi- 
cal convexity to other problems. Euclidean metric between 
matrices is usually used and under this metric the set of all 
positive definite matrices is considered as a cone. However 
in this work we consider the set of all positive matrices as a 
manifold and use the Riemmannian metric between matrices. 
It turns out that while F(E) in nonconvex in Euclidean metric, 
it is convex in Riemmannian metric, and this formulation is 
more powerful than similar formulations that are convex in 
Euclidean metric [35), ifTTl . It would be interesting if there are 
other optimization problems with the property of geodesical 
convexity. 
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VII. Appendix 

A. Proof of Lemma XIII. 2\ 

Proof: Geodesical convexity of F(H) follows from jIII.21 ) 
and Lemma IlLTl Therefore we only need to prove (IIII.2t for 
geodesic convexity. 

We will prove (1111.21 ) by showing that, if £ 3 G S ++ (D) is 
the geometric mean of Si, £2 G S++(D), then we have 



ln(dct(Ei)) + ln(dct(S 2 )) = 21n(det(S 3 )), (VII. 1) 



and 



ln(a; T Sia;) + ln(a; T S 2 a;) > 2 ln^I^a;) 



(VII.2) 



We start with the proof of ( IVII.U . Use (1II.21 > with t = |, 
we have 

=S 2 . (VII.3) 

Using ( IVII.3I ). ( IVII.ll ) can be proved as follows: 

det(E 2 ) = dct(S 3 Sr 1 S 3 ) = det(E 3 ) dct(Sf 1 ) det(£ 3 ) 
= dct(£ 3 ) 2 /dct(S 1 ). 



To prove dVII.2t . we let the SVD decomposition of 
Sj 2 S 2 S 1 2 = UqYIoUq and define x = UqH^x, then we 
have x T T,iX = x T x, x T Y,2X = x T H^x, and x T Y,^x = 



8 



Assuming that X is a diagonal matrix with diag- Let A 



onal entries ai,(X2, - ■ ■ , er p and i: = (xi , £2 , 
( I V1I-21 > is equivalent to 



, then 



p 

E aii * 



£*?>(5>?tf) a 



which can be verified by Cauchy-Schwartz inequality. There- 
fore ( IVII.21 i is proved. 

Finally we find the condition such that the equality in flHI.2| > 
holds. By its proof of geodesic convexity we know that it holds 
only when the equality d VII.2| > holds for any x E X. 

By the condition of equality in Cauchy-Schwartz inequality, 
we have that the equality in ( IIII.2I ) only holds when for any 
1 < i < D (here i is the index of coordinates) such that 
ii 7^ 0, Oj = c for some cel. When Si ^ cX 2 , ct; is not 
the same number for all 1 < i < D. Therefore there exists 
1 < i < D such that Xi = 0. That is, there exists a hyperplane 
in R D such that x lies on it. Since x is a linear transformation 
of x, when dVII.2t holds for any x £ X, then there exists a 
hyperplane such that it contains x E X, which contradicts our 
assumption that spanjA"} = R D . ■ 



B. Proof of Theorem \III.4\ 

First we will prove that the operator T is monotone with 
respect to the objective function F: F(T(X)) < F(X), and 
the equality holds for X £ S ++ (D) only when T(X) = X. 

We prove it by constructing the following majorization 
function over F(X): 



T 
X X 



C- 1 ^ + -^logdct(S)+C. 

(vn.4) 

When C is well chosen such that G(X*, X*) = F(X*). The 
fact 

G(X,X*) > F(X) 

can be proved by checking the first and the second derivative 
of G(X, X*) - F(X) with respect to X" 1 . 

It is easy to verify the unique minimizer of G(X, X*) is 



F 



iV ^ x T X* 



E 



l 03 



which is a scaled version of T(X*). Therefore we prove the 
mono tonicity of T as follows: 

F(T(X*)) = F(X) < G(X, X*) < G(X*, X*) = F(X*). 

(vn.5) 

Because of the uniqueness of the minimizer of G(X, X*), the 
equality in the second inequality of (IVII.5I ) holds only when 
X = X*. Since X = cT(X*) and tr(X*) = tr(T(X*)) = 1, 
the equality in dVIOb only holds when T(X*) = X*. 

Therefore the sequence F{H^) is monotone, and any 
accumulation points of the sequence {X^}, X, satisfies 
F(T(X)) = F(X) and therefore T(X) = X. 

Applying T(X) = X, we have 



T 
X X 



x T S- 



cl, for some c £ 



(VII.6) 



log(X : ), applying logdct(X) = — tr(A) and 
I = cxp(A), the derivative of F(X) with respect 



to A is 



dA V ; N 



1 \ " 



Since the set {A : A = log(X _1 ), where det(X) = 1} = 
{A : tr(A) = 1}, applying dVH.6l > the derivative of F(X) 
with respect to A in the set {X : det(X) = 1} is at CqX, 
where cq is a number chosen such that det(coX) = 1. Since 
both the set dct(X) = 1 and F(X) are geodesically convex 
(see ( I VII. II ) for the convexity of the set), coX is the unique 
minimizer of F(X) in the set {X : dct(X) = 1}. Applying 
the scale invariance of F(a) in ( 11.61 ), X is the unique solution 
in the set {X : tr(X) = 1}, which means that X is also the 
unique solution to dl.lt . 



C. Proof of Lemma MIL 3 1 

Proof: If Lemma IIII.3I does not hold, then there exists 
a sequence X m such that it converges some X £ S + (D) \ 
S ++ (D), and the sequence F(X m ) is bounded. WLOG we 
assume that Aj(X mi ) and Vj(X mi ) also converge for any 1 < 
j < p, where Aj(X) and Vj(H) are the j-th eigenvalue and 
eigenvector of X. This can be assumed since any sequence 
has a subsequence satisfying this property (eigenvectors and 
eigenvalues of X m lie in a compact space). 

We prove (lIII.3t by induction on the ambient dimension D. 
When D=2, we have dim(kcr(X)) = 1, and 

F(X m ) (VII.7) 
^ E (log(A 2 (X m ))+21og( : r T « 2 (X m ))) 

+ilog(A 2 (X m )) + ilog(A 1 (X m )). 

When x ^ ker(X), we have liminfm^oo a; T W2(X m ) > 0, 
therefore the term log(a; T t>2(X„ l )) is bounded from below. 
Applying the assumption that Ai(X m ) are bounded from 
below, i log(A 1 (X„ l )) is also bounded from below. Applying 
the assumption L^A^iM! > I an( j linim^oo A2(X m ) = 0, the 
RHS of dVII.7t converges to +00, which is a contradiction to 
the assumption that F(X m ) is bounded, and therefore (lHI.3t 
is proved. 

If dill. 3b holds for the case dim(;c) < Dq, then we will 
prove (lIII.3t for dim(a;) = Dq. By the assumption on the 
convergence of eigenvectors and eigenvalues of X^ fc ', to prove 
dIII.3t it is equivalent to prove that 



F(x; 



00 as m — > 00, 



(VII.8) 



L x 



where X^ = PjX m P £ , L = ker(X), d Q = dim(L) and 
F' : S ++ (d ) ^ M is defined by 

F 'W = 7j E ^((Plxf^Plx) + i-logdet(X). 
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An important observation is that lim 

A 

A 

d n \X\L\ 



Combine it with > 4-, we have 

lim F'(£'J-F'(£;j = (^_ 



tr(SJ n ) = 0. By checking the trace of the numerator of the LHS in (IVII. 1 3b 
and dVH.141 i we have 



D, 



N 



) lim logtr(S:„) 

m— too 

(Vn.9) and 



E 

x£X 2 



Plxx T P L , 



\PlM 



ll ld 



(VII. 15) 



When £' n converges to a nonsingular matrix X', 

lim F'CS' ) = F'(S') = C (VII.10) 

m— >-oo 

for some constant C, and when S' m converges to a singular 
matrix, by induction 



lim F'(£'J = oo. 



(VII. 11) 



Combining dVIL9b , ( IVII. 101 ) and ( IVII. 111 ). dVTl8b is proved 
and therefore Lemma 1111.31 is proved by induction. 



D. Proof of Theorem \L1\ 

The roadmap of the proof is as follows. We denote the set 
of outliers by Xq = X \ i* and let X\ = X n L*, and let 
iVi = l^i | and N = \X Q \. Assume that the solutions of ( lITTb 
for the set and 3^o are Id /d and lo-d/ {D — d) respectively, 
then we will prove that 

1 L „ (VII. 12) 



lim = -n, 

k— >oo a 



which implies Theorem lI.il 

WLOG we can make these the assumptions on the solutions 
of ( II. II ) for the set J^i and since the points in y>i and 3^o lie 
in general positions, and applying Theorem Mill the solution 
to dill ) for the set y% and 3^o are nonsingular. Assuming 
the solution of dill ) for the set J^i and 3^o are Si and S2 
respectively, then the following set X', which is a linear 
transformation of X, satisfies that the solution to (11.11 ) for 
the set y[ and y' 2 (they are generated by X') are Lj/d and 
I D -d/(D-tI): 



X' = {y Sin L> a; + \J'E2U. L ±x :xeX}. 

If the algorithm in (II. 2b for A"' converges to gllx, then 
by linear transformation the algorithm for X converges to 
Pl^YixP^ , whose range is also in L*. Therefore to prove 
Theorem II II we only need to prove (IVII. 12b . 

Now we start to prove (IVII. 121 . Using the update formula 
in dl.2b and the assumption that the solutions of dllb to J^i 
and yo are Id/d and Id_<j/(-D — d) respectively, we have 



E 



xeXi 



II ft. HI 



tr 



(VII.13) 



and 



xx 1 P T 



xGXo \\P L 



1 



ME 



C T P , 



(VII. 14) 



E 



P"[ x XX T P L i 



\Xo\ 



No 



D 



l-D-d 



D 



Applying dVII. 15b and d VII. 16b we have 

,...„T 



(VII. 16) 



XX 



xEX 



1 a; 



>a-L(^, E 



a; a; 



=^A mJn (Pf.EP £ .), 



and 



x iP T 



iVn 



Amax(-P i i Sf^x ] 
(^xSP^i). 



D-d 

Combining them with the definition of the operator T in dIII.4b . 
we have 

A min (PjT(S)PiJ A min (PjSP L J 



> a- 



A max (P^T(£)P L x) " A^fP^EPn)' 



where a 

\xnL,\ d_^ 

\X\ ^ D>- 

Therefore 



jVi(-D-d) 
N d 



> 1 (it follows from the assumption 



lim 

k— >oo A 
=00. 



A min (P£E< fc >P L . 



[(P T s(fc)p ) - fc 



> lim a" 



_ t A min (Pf t S( 1 )P L J 
A max (P71S(DP i x) 



(VII. 17) 

Since tr (£(*)) = 1 for all k > 0, we have 
lim A max (PfxS (fc) P L x) = 0, and lim P^£«P L x = 

k— too * * k^-oo * * 

(VII 18) 

Combining ( IVII. 1 8b and the fact that X( fe ) is positive semidef- 
inite, 



lim Pj S (fc) P r x =0. 



(VII. 19) 



Since we already obtained ( IVII. 1 8b and (IVII. 19b , in order 
to prove ( IVII. 12b we only need to prove that P^ £Pl„ 
converges to Id/d. Applying ( IVII. 15b we have 



Amax (Pl, 'E 



XX 



<E^« 

x£Xo 



x£X 



T 



XX 



T _ 

L 'x T T,- 1 



Pjj + A 



c(V Pi 



XX 



Pl) 



<A max (PjxSP L x) 



xeXn 



iFir 



^A max (P££P L J 
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and 

An 



> A 



(VP 



Therefore 



mm \ / 



L : ' 



therefore for any E\ > 0, there exists some n > such that 

A majc (P^T«"(S)P i , 



>^A ■ <» T 
a 



A max (P^T(£)P L J 



x(PJSPl.) 

A min (Pj,T(£)P L J " A min (P££P L J 



< 



dA max (Pj x EPix)E 



11° 

xGAT IIP, 



(VII.20) 



^iA min (p/;sp L j 

Now we will prove that 

A min (P££ (fe) P L J > c for all fc > 1 for some c> 0. 

(VII.21) 

If JVII.2 lb does not hold then there exists a subsequence S fcj 
such that 

Um fcj _>„ A mi n(P£ Sfe) Pl, ) = 0. Applying (TVILT81) . 
(|VII.191 l and the induction argument in the proof of 
Lemma jlll.31 we have lirrifc ^(S^"^) = oo, which con- 
tradicts the monotone property of the algorithm in (IVII.5) . 
Therefore (1VII.2U is proved. 

Applying (IVII.17I ), there exists a constant C > such that 

A max (P^£«P^) < CcT k . (VII.22) 

Now we prove the existence of lim^oo A m " P T' S (t) Pl ') ■ 
If it does not exist, then there exists e > such that for any 
sufficiently large Kq, there exists k\ > &2 > Kg such that 



A max (P£ £( fe i)P Lt ) A max (P£ S^)Pi, 



A min (P2;s(fei)P L J A min (P££fe)P L ,: 



> £. 



Summing dVII.20t for £ = £( fc2 ), E^ 1 ), • • • , SO 1 " 1 ), and 
apply (IVII.2U and dVII.221 ) we have the contradiction for 
sufficiently large Kq. 
Next we will prove 



lim 



t(P T S ( fc )p £j 

fcTi A min (P^S( fc )P L J 



by 
liim 



contradiction, 



i.e., 



by 



(VII.23) 
assuming 



ifc^oo Amin (Pf swfi j = co > 1. Since the sequence 
S' fe ) lies in compact space, there is a subsequence {E ( - fc '" ) }j>i 
converging to X with 

Ama-jcCP/^ SPl» 



A min (P££P L J 



= c > 1. 



(VII.24) 



Applying dVII.18t and dVII.191 ) we have n L ,Sn L , = £. By 
simple calculation this property also holds for T™(£) for any 
n. Therefore the update T"(S) can be considered as a update 
only depends on the set 34- Then by using Theorem IIII.4I to 
the set y± we have 

lim T n (t) = -U Lt , 

n— >oo ci 



A min (P^T«o(S)P L J 



< 1 +£l. 



(VII.25) 



Using the continuity of the mapping T"°, for any n > there 
exists £2 > such that 

\ max (P[T n « (±)P L , ) A max (P£ T™° (S)P L , ) 



A mi „(P£T»« (E)P L . ) A min (P£T™° (S)P L , ) 



< »7, 

(VII.26) 
when ||S - S|| < e 2 . 

Choose jo large enough such that ||S( fc Jo) — S|| < £2, then 
applying dVH.251 ) and dVII.261 ) with £ = we obtain 

A max (Pf S^o+»o Pi j 



A min (P^S^o+»op L j 



< 1 + £1 + 77. (VII.27) 

Summing ( IVII.20b with £ = S fc for all k > k jo + n , 
applying ( IVII.2U and dVII.221 ) we obtain that for some C\ > 
0, 



<- 



co 

An 



lim 

k—t-oc \ 



A max (Pf S( fe )p L , 



(Pj.sWPi 

(Pf S^D+^oPiJ 



(VII.28) 



<1 + da 



Ei+rj. 



Since we can choose £1, 77 arbitrarily small and fc JO , no 
arbitrarily large, (lVII.28t is a contradiction to (IVII.24) . There- 
fore dVII.231 l is proved. Combining (IVII.231 > with (lVII.18b and 
( lVII.19b and notice that tr(S (fc) ) = 1 for all k > 0, we proved 
(IVII.12I ). 
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